Textmining and Organization in Large Corpus

نویسندگان

Wei Ning

Jan Larsen

چکیده

Nowadays a common size of document corpus might have more than 5000 documents. It is almost impossible for a reader to read thought all documents within the corpus and find out relative information in a couple of minutes. In this master thesis project we propose text clustering as a potential solution to organizing large document corpus. As a sub-field of data mining, text mining is to discover useful information from written resources. Text clustering is one of topics in text mining, which is to find out the groups information from the text documents and cluster these documents into the most relevant groups automatically. Representing document corpus as a term-document matrix is the prevalent preprocessing in text clustering. If each unique term is taken as a dimension, a common size of corpus may contain more than ten-thousands of unique term, which results in extremely high dimensionality. Finding good dimensionality deduction algorithms and suitable clustering methods are the main concerns of this thesis project. We mainly compare two dimensionality deduction methods: Singular Vector Decomposition (SVD) and Random Projection (RP), and three selected clustering algorithms: K-means, Non-negative Matrix Factorization (NMF) and Frequent Itemset. These selected methods and algorithms are compared based on their performance and time consumption. This thesis project shows K-means and Frequent Itemset can be applied in large corpus. NMF might need more research on speeding up its convergence speed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GENIA corpus - a semantically annotated corpus for bio-textmining

MOTIVATION Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. RESULTS G...

متن کامل

Textmining: Generating association rules from textual data

Textmining is an emerging research area, whose goal is to discover additional information from hidden patterns in unstructured large textual collection. Hence, given a collection of text documents, most approaches of text mining perform knowledge-discovery operations on labels associated with each document, which are usually keywords that represent the result of non-trivial keyword-labeling pro...

متن کامل

Large Sphenoethmoidal Encephalocele Associated with Agenesis of Corpus Callosum and Cleft Palate

Basal encephalocele is a rare craniofacial anomaly. In the present paper we report a 10-year-old boy presented with cleft palate, congenital nystagmus, and hypertelorism. During preoperative evaluation for cleft palate repair, a pulsatile mass was detected in the pharynx. Magnetic resonance imaging showed sphenoethmoidal type of basal encephalocele and agenesis of corpus callosum. Neurosurgical...

متن کامل

Modeling text with generalizable Gaussian mixtures

We apply and discuss generalizable Gaussian mixture (GGM) models for textmining. The model automatically adapts model complexity for a given text representation. We show that the generalizability of these models depends on the dimensionality of the representation and the sample size. We discuss the relation between supervised and unsupervised learning in text data. Finally, we implement a novel...

متن کامل

A symbolic approach to automatic multiword term structuring

This paper presents a three-level structuring of multiword terms (MWTs) basing on lexical inclusion, WordNet similarity and a clustering approach. Term clustering by automatic data analysis methods offers an interesting way of organizing a domain’s knowledge structures, useful for several information-oriented tasks like science and technology watch, textmining, computer-assisted ontology popula...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Textmining and Organization in Large Corpus

نویسندگان

چکیده

منابع مشابه

GENIA corpus - a semantically annotated corpus for bio-textmining

Textmining: Generating association rules from textual data

Large Sphenoethmoidal Encephalocele Associated with Agenesis of Corpus Callosum and Cleft Palate

Modeling text with generalizable Gaussian mixtures

A symbolic approach to automatic multiword term structuring

عنوان ژورنال:

اشتراک گذاری